ComMap: a software to perform large-scale structure-based mapping for cross-linking mass spectrometry

Abstract Motivation Chemical cross-linking combined with mass spectrometry (CXMS) is now a well-established method for profiling existing protein–protein interactions (PPIs) with partially known structures. It is expected to map the results of CXMS with existing structure databases to study the protein dynamic profile in the structure analysis. However, currently available structure-based analysis software suffers from the difficulty of achieving large-scale analysis. Besides, it is infeasible for structure analysis and data mining on a large scale, since of lacking global measurement of dynamic structure mapping results. Results ComMap (protein complex structure mapping) is a software designed to perform large-scale structure-based mapping by integrating CXMS data with existing structures. It allows complete the distance calculation of PPIs with existing structures in batch within minutes and provides scores for different PPI-structure pairs of testable hypothetical structural dynamism via a global view. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Chemical cross-linking combined with mass spectrometry (CXMS) has been widely applied in the research of protein interaction and structure in recent years (Wheat et al., 2021). In CXMS, crosslinkers are used to covalently link two residues to form an informative linkage between proteins, which can be identified from MS/MS spectra by dedicated software (Yu and Huang, 2018). By mapping the cross-linking results to the existing protein structures, complete cross-linked protein structure information can be delineated and can be utilized for subsequent protein complexes structural analysis or protein docking analysis. However, current structure software can neither achieve the structural analysis on large scale nor measure the global dynamic structure mapping results. For instance, proXL (Riffle et al., 2016) and xiView (Graham et al., 2019) can only load one single user-defined Protein Data Bank file at a time, while some other web programs designed for batch processing, such as XLinkDB are focused on structural mapping of specific Lys-Lys cross-linkers (Schweppe et al., 2016) so that site-non-specific crosslinkers are deserted (Zhang et al., 2022). Moreover, the results of CXMS involved dynamic information. 'Over-length' cross-links may arise from the alternative excited-state conformation of the protein complex (Ding et al., 2017). Recently, a comprehensive structure-based evaluation of different search tools has been performed (Yugandhar et al., 2020). Unfortunately, the above information is not considered in the current structure mapping tools. For this reason, researchers can only perform dynamic analysis for several proteins without a comparable scoring system, but not dynamism profile via a global view. Lastly, present protein interactions docking and structural simulation based on cross-linking information rely on manually filtered cross-links, without algorithmic discrimination, which may bring excessive subjective impact on the quality of model construction (Mintseris and Gygi, 2020). It's infeasible for the CXMS community to analyze and perform structurebased data mining in biological systems rapidly on large scale.
Herein, we propose ComMap, a software designed to perform large-scale structure-based mapping on CXMS data. First, ComMap enables reading thousands of protein structure files and completing the distance calculation from existing structures in minutes. Second, ComMap is not restricted to specific cross-linking protocols and is capable of kinds of cross-linkers.
Moreover, ComMap identifies known static PPIs and measures testable hypothetical dynamism of protein structure across diverse PPI-structure pairs. Hence, biologists can now perform structural dynamic analysis and protein docking for thousands of PPIs under a comparable scoring standard.

Implementation
ComMap is implemented in Python 3.8 and includes three steps: data preparation, distance analysis and results export (Fig. 1A).

Data preparation
Firstly, ComMap reads cross-linking results, which are generated from pLink2, SpotLink or generic PPI files, and the corresponding fasta file. Next, ComMap reads local or online Protein Data Bank (PDB) protein structure files in mmCIF format according to user configurations (detailed description in Supplementary Note S1).

Distance analysis
Next, ComMap performs alignment on the protein structure sequence and identified protein sequences using the BLAST tool. After that, ComMap calculates the residue distance for each PPI-structure pair. Following, ComMap scores the PPI-structure pairs based on several features, which reflects structural dynamism via a global view (detailed description in Supplementary Note S1).

Results export
Finally, ComMap generated three files as outputs, including a comprehensive file that saved all PPI and structural distance information, a categorized file of PPIs based on structure entries and a ComMap score file for PPI-structure pairs. Pymol scripts of protein structures can be easily achieved for visualization in these files.

Results
Three PPIs datasets were analyzed by ComMap as an illustration, including the BSA dataset, the proteasome dataset and the K562 cells dataset.
To demonstrate the ability of ComMap to analyze multiple types of cross-linkers generated data, we analyzed the BSA dataset at first (Iacobucci et al., 2019), which involved DSS, BS3 and DSBU crosslinkers (detailed description in Supplementary Note S2). The ComMap scores distribution in this dataset concentrates in the highscoring region, indicating relatively static interactions, which correspond to BSA characteristics. Besides, we offered a step-by-step user guide for ComMap on the BSA dataset.
To demonstrate the performance of ComMap score in target protein complexes, we obtained and analyzed the proteasome dataset from a sample of Saccharomyces cerevisiae (PXD011296) (Mintseris and Gygi, 2020). ComMap observed the distance information of 616 proteasome interactions from 666 protein structures and generated high-density proteasome PPI-structure mapping. Analysis of the ComMap score demonstrated that most interactions are relatively static interactions in this dataset (Fig. 1B). In-depth investigation of these interactions demonstrated that ComMap is a reliable tool for analyzing protein cross-linking and structural dynamics (detailed description in Supplementary Note S3).
Then ComMap was used to explore Homo sapiens-related PPIs. We obtained and analyzed the human cell cross-linking dataset from the K562 cells sample (PXD018771) (Yugandhar et al., 2020). ComMap observed the distance information of 2651 interactions from 8284 protein structures. We observed ComMap score distributions on this dataset similar to those on the proteasome dataset (Fig. 1C). An example of the PPI mapping on human 80S ribosomes with ComMap from this dataset is illustrated (Fig. 1D, detailed structure information in Supplementary Materials). Additionally, we discovered testable hypothesis of dynamic property from several protein complexes, including ribosome, GTP-binding nuclear protein and calmodulin protein (detailed description in Supplementary Note S4).

Conclusions
As a result, ComMap is a valuable tool for structure-based dynamism measurement analysis on CXMS data ranging from target protein complex to proteome in a flexible and dynamic way. However, at present, all the cross-linking structure dynamism analysis can only be performed based on the recorded structures in PDB. Machine learning methods could be probably introduced in the scoring system in the future, which may reduce recorded structures dependency. With sufficient learning experience brought by cumulative datasets, ComMap will greatly facilitate biologists for protein structure and PPIs studies or CXMS-based protein docking studies.